This vignette book documents the SeSAMe package in full to supplement the vignettes hosted on the Bioconductor.
CRITICAL As sesame and sesameData are under active development, this documentation is specific to the following version of sesame, sesameData and ExperimentHub:
## sesame sesameData ExperimentHub
## "1.13.35" "1.13.32" "2.3.5"
We recommend updating your R, ExperimentHub, sesameData and sesame to use this documentation consistently. If you install directly from github, you need to make sure the compatible ExperimentHub is installed.
CRITICAL: After a new installation, one needs to cache the associated annotation data using the following command. This needs to be done only once per SeSAMe installation/update.
## snapshotDate(): 2022-03-01
This function caches the needed SeSAMe annotation for all the supported platforms. SeSAMe annotation data is managed by the sesameData package which uses the ExperimentHub infrastructure. You can find the location of the cached annotation data on your local computer using:
## [1] "/Users/zhouw3/Library/Caches/org.R-project.R/R/ExperimentHub"
The raw Infinium BeadChip data are stored in IDAT files. Each sample has two IDAT files and they correspond to the red and green signal respectively. Green and red files for the same samples should always share the same sample name prefix. For example, 204529320035_R06C01_Red.idat and 204529320035_R06C01_Grn.idat correspond to the red and green signal of one sample. In SeSAMe, we will use the common prefix, i.e. 204529320035_R06C01, to refer to that sample. SeSAMe recognizes both the raw IDAT as well as gzipped IDATs which are common for data stored in GEO. For example, in addition to the example above, SeSAMe also recognizes GSM2178224_Red.idat.gz and GSM2178224_Grn.idat.gz.
The function readIDATpair function reads in the signal intensity data from the IDAT pairs. The function takes the common prefix as input and outputs a SigDF object. The SigDF object is simply an R data.frame with rows representing probes and columns representing different signal intensity and probe annotations. The SigDF class will be discussed more in a separate section below. Using the two examples above, one would run the following commands.
sdf = readIDAT("idat_folder/204529320035_R06C01") # Example 1
sdf = readIDAT("idat_folder/GSM2178224") # Example 2Note that SeSAMe automatically detects and matches up the green and red signal files for the same sample. SigDF is and can be regarded and treated as a regular data.frame:
One can summarize resulting SigDF using the ‘sesameQC_calcStats’ function (more QC can be found in the quality control vignette).
##
## =====================
## | Number of Probes
## =====================
## N. Probes : 866553 (num_probes)
## N. Inf.-II Probes : 724429 (num_probes_II)
## N. Inf.-I (Red) : 92192 (num_probes_IR)
## N. Inf.-I (Grn) : 49932 (num_probes_IG)
## N. Probes (CG) : 862927 (num_probes_cg)
## N. Probes (CH) : 2932 (num_probes_ch)
## N. Probes (RS) : 59 (num_probes_rs)
If you are dealing with a custom-made array instead of the standard array (MM285, EPIC, HM450 etc) supported natively by SeSAMe, you would need to provide a manifest that describes the probe information. You should be able to obtain the probe information manifest from the Illumina support website. The manifest should be formated as a data frame with four columns minimally: Probe_ID, M, U and col. A optional mask column may also be included as a default mask for the platform. The easiest way to format a SeSAMe-compatible manifest is by following internal manifests for a SeSAMe-supported platform. They can be retrieved with the sesameDataGet function:
The col is either G (which stands for Green) or R (which stands for Red) or 2 (which stands for Infinium II designs). For Infinium-II probes, the M column and col column is left as NA. For example, one can check that both M and col columns are filled with the Infinium-I probes (in mouse array this can be indicated by a _[TBN][CON]1 suffix):
The last column mask is a logical vector that defines the default masking behavior of SeSAMe for the platform (see below for discussion of NA-masking). With the manifest, your data can be processed using the manifest= option in openSesame or readIDATpair (one sample).
In most cases, we would be working with a folder that contains many IDATs. Here is where the searchIDATprefixes function comes in useful. It lets us search all the IDATs in a folder and its subfolders recursively. Combine this with the R looping functions lets you process many IDATs without having to specify all IDAT names. searchIDATprefixes returns a named vector of prefixes with associated Red and Grn files, which can be given to readIDATpair:
which returns a list of “SigDF”s. This is how the openSesame pipeline is handling your data internally.
DNA methylation level (aka the β values) are defined as
β = M/(M + U)
M represents the signal from methylated allele and U represents the unmethylated allele. It can be retrieved calling the getBetas function with the SigDF as input. The output is a named vector with probe ID as names. For example, the following commands read in one sample and convert it to β values.
## cg00000029 cg00000103 cg00000109 cg00000155 cg00000158 cg00000165
## 0.8237945 0.2100515 0.8125637 0.9152265 0.9105163 0.8196466
CRITICAL: getBetas takes a single SigDF object as input instead of a list of SigDFs. A common mistake is to c-merge multiple SigDFs. To combine multiple SigDFs, one can use list() instead. To process many SigDFs, we should combine that with looping functions lapply or mclapplys, or using the openSesame pipeline (see below).
β values for Infinium-I probes can also be obtained by summing up the two in-band channel and out-of-band channels. This rescues probes with SNP hitting the extension base and hence switching color channel. More details can be found in Zhou et al 2017.
As mentioned above, experiment-dependent masking based on signal detection p-values is effective in excluding artifactual methylation level reading and probes with too much influence from signal background. We recommend the pOOBAH algorithm that was based on Infinium-I probe out-of-band signal for calibrating the distribution of the signal background:
## [1] 0
## [1] 37964
## [1] 37964
Sometimes one would want to calculation detection p-value without modifying the mask. For example, one may want to upload the p-values to GEO separately. In those cases one can use the return.pval option and add pvalue-based mask later.
SeSAMe implements the background subtraction based on normal-exponential deconvolution using out-of-band probes noob (Triche et al. 2013) and optionally with more aggressive subtraction (scrub). One can use following β value distribution plot to see the effect of background subtraction. For example, the two (M and U) modes are further polarized.
par(mfrow=c(2,1), mar=c(3,3,2,1))
sesameQC_plotBetaByDesign(sdf, main="Before", xlab="\beta")
sesameQC_plotBetaByDesign(noob(sdf), main="After", xlab="\beta")Dye bias refers to the difference in signal intensity between the two color channel. SeSAMe offers two flavors of dye bias correction: linear scaling (dyeBiasCorr) and nonlinear scaling (dyeBiasCorrTypeINorm). Linear scaling equalize the mean of all probes from the two color channel.
par(mfrow=c(1,2))
sesameQC_plotRedGrnQQ(dyeBiasCorr(sdf), main="Before") # linear correction
sesameQC_plotRedGrnQQ(dyeBiasNL(sdf), main="After") # nonlinear correctionResidual dye bias can be corrected using nonlinear quantile interpolation with Type-I probes. Under this correction, Infinium-I Red probes and Infinium-I Grn probes have the same distribution of signal. Note that linear scaling does not shift beta values of Type-I probes while nonlinear scaling does shift beta values of Type-I probes.
Sometimes Infinium-I channel spec is inaccurate in the manifest. We can infer the channel from data.
## Infinium-I color channel reset:
## R>R: 92127
## G>G: 49028
## R>G: 65
## G>R: 904
As one can see, most probes remain with the designated channel. A small fraction of the probes is considered “channel-switching”.
As we may’ve noticed, even with proper dye bias correction, there is still remaining differences in β value distribution between Infinium-I and Infinium-II probes due to the signal tail inflation. One solution is to perform ad-hoc quantile normalization. We provide a function similar to the BMIQ algorithm with modifications (one-, two- and three-state mixture and Infinium-I to II matching since Infinium-I is more often influenced by signal inflation based on our experience) to match Infinium-I and Infinium-II beta value distribution. We do not recommend the use of this (or any such methods) for all data unless your data is known to be relatively well-behaving in methylation distribution, for protection of real biological signal. This function assumes Infinium-I/II probes are similar in beta value distribution in the unmethylated and methylated mode.
par(mfrow=c(2,1), mar=c(3,3,2,1))
sesameQC_plotBetaByDesign(sdf, main="Before", xlab="\beta")
sesameQC_plotBetaByDesign(matchDesign(sdf), main="After", xlab="\beta")SeSAMe design includes a light-weight fullly-exposed infrastructure of the internal signal intensities. Central to this infrastructure is the SigDF data structure, which is a data.frame subclass. One can treat it like a regulator data.frame with 7 specific columns, i.e., Probe_ID, MG, MR, UG, UR, col and mask. The col column specifies the color channel and takes G, R and 2. The Infinium-I probes carry G and R in col to indicate the designed color. This infrastructure is nimble to allow change of color channel, and mask (the scope of usable probes) depending on the use of the array on different species, strain, population etc. For example, the following data.frame operation let you easily peek into the signal intensities.
Sometimes, particularly with older arrays, there might exist a controls attributes to contain the control probe information. In the new manifest, the control probes will be parsed and included as regulator probes (except with a ctl prefix in the probe ID). The control probe annotation can be found using the following function:
If you call getBetas as is, you should have noticed that some of the beta values show up having NA values. This NA-masking is controlled internally using the mask column in SigDF. To check probes to be NA-masked in a SigDF, one can use the mask function
## [1] 0
## [1] 0
Please note that mask in SigDF does not actually remove the probe reading but only specify how SeSAMe currently views the measurement (as unreliable). One can add more probes to the mask with the addMask function. Other functions such as the detection p-value calculation (e.g., pOOBAH), also modifies mask. NA-masking influences other normalization and preprocessing functions. Therefore one should have them set for the preprocessing methods mentioned below. No mask is set by default unless done through preprocessing functions. The qualityMask function does some recommended experiment-independent masking. For example, probes with cross hybridization or are influenced by common polymorphisms are masked using this function. For more details of some of the maskings in qualityMask and listAvailableMasks(sdf), one can refer to Zhou et al. 2017.
One can clear existing masking by the resetMask function.
## [1] 105454
## [1] 0
The getBetas function can also ignore NA-masking when extracting beta values by taking a mask=FALSE option:
## [1] 0
In all, most of the masking comes from two major sources:
Experiment-dependent Probe Masking based on signal detection p-value (Zhou et al. 2018). Probes with p-value higher than a threshold (default: 0.05) are masked (see following for detection p-value calculation using the pOOBAH method).
Experiment-independent Probe Masking due to design issues. This is typically designated in the mask column of the manifest (see Zhou et al. 2017): This masking supports EPIC, MM285, HM450 and HM27 and is turned on by default and can also be explicitly added by the function qualityMask:
SigDF can be written as and read from plain text file (e.g, tab-delimited files and comma-delimited files) with the compliant column names (see above).
tsv_file_path = sprintf("%s/sigdf.tsv", tempdir())
sdf_write_table(sdf, file=tsv_file_path, sep="\t", quote=FALSE) # save as tsv
sdf2 = sdf_read_table(tsv_file_path) # read back## Platform set to: EPIC
csv_file_path = sprintf("%s/sigdf.csv", tempdir())
sdf_write_table(sdf, file=csv_file_path, sep=",") # save as csv
sdf2 = sdf_read_table(csv_file_path, sep=",") # read back## Platform set to: EPIC
Previously, the signal was implemented an S4 implementation in SigSet complies with Bioconductor guidelines, and for backwards compatibility, SigSet can be transformed to a SigDF using the SigSetToSigDF function sesame:::SigSetToSigDF(sset).
SigSet/SigDF can be converted back and forth from Minfi RGChannelSet in multiple ways. One can sesamize a minfi RGChannelSet which returns a GenomicRatioSet. See sesamize for more detail.
SeSAMe provides functions to create QC plots. Some functions takes sesameQC as input while others directly plots the SigDF objects. For example, the sesameQC_plotBar function takes a list of sesameQC objects and creates bar plot for each metric calculated.
The fraction of detection failures are signs of masking due to variety of reasons including failed detection, high background, putative low quality probes etc. To compare samples in terms of detection success rate, one can use the sesameQC_plotBar function in the following way:
Dye bias is shown by an off-diagonal q-q plot of the red (x-axis) and green signal (y-axis).
Beta value is more influenced by signal background for probes with low signal intensities. The following plot shows this dependency and the extent of probes with low signal intensity.
Extra SNP allele frequencies can be obtained by summing up methylated and umethylated alleles of color-channel-switching probes. These allele frequencies can be combined with explicit SNP probes:
## rs10033147 rs1019916 rs1040870 rs10457834 rs10796216 rs10882854
## 0.4731873 0.9265673 0.4825240 0.4711608 0.9561655 0.9326652
## cg00038584 cg00408315 cg00413617 cg00488829 cg00519463 cg00523683
## 0.10148976 0.17358858 0.09358459 0.69068952 0.10698911 0.05029466
SeSAMe can extract explicit and Infinium-I-derived SNPs to identify potential sample swaps.
One can also output the allele frequencies and output a VCF file with genotypes. This requires additional SNP information (ref and alt alleles), which can be downloaded using the following code:
## Platform set to: EPIC
## Retrieving annotation from https://github.com/zhou-lab/InfiniumAnnotationV1/raw/main/Anno/EPIC/EPIC.hg19.snp_overlap_b151.rds... Done.
## Retrieving annotation from https://github.com/zhou-lab/InfiniumAnnotationV1/raw/main/Anno/EPIC/EPIC.hg19.typeI_overlap_b151.rds... Done.
One can output to actual VCF file with a header by formatVCF(sdf, vcf=path_to_vcf).
Infinium platforms are intrinsically robust to incomplete bisulfite conversion as non-converted probes would fail to hybridize to the target. Residual incomplete bisulfite conversion can be quantified using GCT score based on C/T-extension probes. Details of this method can be found in Zhou et al. 2017. The closer the score to 1.0, the more complete the bisulfite conversion.
## Platform set to: EPIC
## [1] 1.067769
## R Under development (unstable) (2021-11-09 r81170)
## Platform: x86_64-apple-darwin20.6.0 (64-bit)
## Running under: macOS Big Sur 11.6.2
##
## Matrix products: default
## BLAS: /Users/zhouw3/.Renv/versions/4.2.dev/lib/R/lib/libRblas.dylib
## LAPACK: /Users/zhouw3/.Renv/versions/4.2.dev/lib/R/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.0.8 sesame_1.13.35 sesameData_1.13.32
## [4] ExperimentHub_2.3.5 AnnotationHub_3.3.9 BiocFileCache_2.3.4
## [7] dbplyr_2.1.1 BiocGenerics_0.41.2 rmarkdown_2.11
## [10] R6_2.5.1
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-7 matrixStats_0.61.0
## [3] bit64_4.0.5 filelock_1.0.2
## [5] RColorBrewer_1.1-2 httr_1.4.2
## [7] GenomeInfoDb_1.31.4 tools_4.2.0
## [9] bslib_0.3.1 utf8_1.2.2
## [11] KernSmooth_2.23-20 DBI_1.1.2
## [13] colorspace_2.0-3 withr_2.5.0
## [15] tidyselect_1.1.2 base64_2.0
## [17] preprocessCore_1.57.0 bit_4.0.4
## [19] curl_4.3.2 compiler_4.2.0
## [21] cli_3.2.0 Biobase_2.55.0
## [23] RPMM_1.25 DelayedArray_0.21.2
## [25] labeling_0.4.2 sass_0.4.0
## [27] scales_1.1.1 readr_2.1.2
## [29] askpass_1.1 rappdirs_0.3.3
## [31] stringr_1.4.0 digest_0.6.29
## [33] illuminaio_0.37.0 XVector_0.35.0
## [35] pkgconfig_2.0.3 htmltools_0.5.2
## [37] MatrixGenerics_1.7.0 highr_0.9
## [39] fastmap_1.1.0 rlang_1.0.2
## [41] RSQLite_2.2.10 shiny_1.7.1
## [43] farver_2.1.0 jquerylib_0.1.4
## [45] generics_0.1.2 jsonlite_1.8.0
## [47] wheatmap_0.2.0 BiocParallel_1.29.15
## [49] RCurl_1.98-1.6 magrittr_2.0.2
## [51] GenomeInfoDbData_1.2.7 Matrix_1.4-0
## [53] Rcpp_1.0.8.2 munsell_0.5.0
## [55] S4Vectors_0.33.10 fansi_1.0.2
## [57] lifecycle_1.0.1 stringi_1.7.6
## [59] yaml_2.3.5 MASS_7.3-55
## [61] SummarizedExperiment_1.25.3 zlibbioc_1.41.0
## [63] plyr_1.8.6 grid_4.2.0
## [65] blob_1.2.2 parallel_4.2.0
## [67] promises_1.2.0.1 crayon_1.5.0
## [69] lattice_0.20-45 Biostrings_2.63.1
## [71] hms_1.1.1 KEGGREST_1.35.0
## [73] knitr_1.37 pillar_1.7.0
## [75] GenomicRanges_1.47.6 reshape2_1.4.4
## [77] stats4_4.2.0 glue_1.6.2
## [79] BiocVersion_3.15.0 evaluate_0.15
## [81] BiocManager_1.30.16 png_0.1-7
## [83] vctrs_0.3.8 tzdb_0.2.0
## [85] httpuv_1.6.5 openssl_2.0.0
## [87] gtable_0.3.0 purrr_0.3.4
## [89] assertthat_0.2.1 cachem_1.0.6
## [91] ggplot2_3.3.5 xfun_0.29
## [93] mime_0.12 xtable_1.8-4
## [95] later_1.3.0 tibble_3.1.6
## [97] AnnotationDbi_1.57.1 memoise_2.0.1
## [99] IRanges_2.29.1 cluster_2.1.2
## [101] ellipsis_0.3.2 interactiveDisplayBase_1.33.0
## [103] BiocStyle_2.23.1